CCDL 2020

Objective for this notebook analysis:

We’ll use the same gene expression dataset we used in the previous notebook. It a pre-processed astrocytoma microarray dataset that we performed differential expression analyses.

More tidyverse resources:
- R for Data Science - Tidyverse documentation - Cheatsheet of Tidyverse data transformation - Online Tidyverse book chapter

Concepts to cover (in order) - Referencing a library’s function with :: - dplyr::select - dplyr::filter - dplyr::mutate - apply functions? - Writing to RDS/TSV

Set Up

Declare the name of the directory where we will read in the data.

data_dir <- "data"

The tidyverse is a collection of packages that are handy for general data wrangling, analysis, and visualization. Other packages that are specifically handy for different biological analyses are found on Bioconductor. If we want to use a package’s functions we first need to install them. In our Docker container, we already have tidyverse and other packages we will use in this workshop installed for you. But if you needed to install it or other packages available on CRAN, you do it using the install.packages function like this: install.packages("tidyverse")

library(tidyverse)

Referencing a library’s function with ::

Note that if we had not imported the tidyverse set of packages using library like above, and we were using a tidyverse function like read_tsv, we would need to tell R what package to find this function in. To do this, we would use :: to tell R to load in this function from the readr package by using readr::read_tsv. You will see this :: method of referencing libraries throughout the course.

Managing directories

Before we can import the data we need, we should double check where R is looking for files aka our what our working directory currently is. We can do this by using the getwd() function, which will tell us what folder we are in.

# Let's check what directory we are in:
getwd()
[1] "/home/rstudio/kitematic/training-modules/intro-to-R-tidyverse"

Note that for Rmd files, the working directory is always wherever the file is located.

We will want to make a directory for our output and we will call this directory: results. But before we create the directory, we should check if it already exists. We will show two ways that we can do this.

First, we can use the dir() function to have R list the files in our working directory.

# Let's check what files are here
dir()
 [1] "00a-rstudio_guide.md"           "00b-debugging_resources.md"    
 [3] "01-introduction-base-R.nb.html" "01-introduction-base-R.Rmd"    
 [5] "02-intro_to_ggplot2.nb.html"    "02-intro_to_ggplot2.Rmd"       
 [7] "03-intro_to_tidyverse.nb.html"  "03-intro_to_tidyverse.Rmd"     
 [9] "data"                           "diagrams"                      
[11] "metadata.tsv"                   "plots"                         
[13] "README.md"                      "refine.bio"                    
[15] "results"                        "screenshots"                   
[17] "scripts"                       

This shows us there is no folder called “results” yet.

If we want to more pointedly look for “results” in our working directory we can use the dir.exists function.

# Check if the results directory exists
dir.exists("results")
[1] TRUE

If the above says FALSE that means we will need to create a results directory using the function dir.create.

# Make a directory within the working directory called 'results'
dir.create("results")
'results' already exists

After creating the results directory above, let’s re-run dir.exists to see if now it exists.

# Re-check if the results directory exists
dir.exists("results")
[1] TRUE

We can use the output of dir.exists to automatically create or hold off on creating a directory by putting this together in an if statement like below. Note that we put an exclamation point to signify that we want a directory to be created only if dir.exists(results) is NOT equal to TRUE.

# If 'results' directory doesn't exist...
if (!dir.exists("results")) {
  # ... create a 'results' directory
  dir.create("results")
}

Note that the dir.exists function will not work on files themselves. In this case, there is an analogous function called file.exists.

Try using the file.exists function to see if the file gene_results_GSE44971.tsv file exists in the current directory. Use the code chunk we set up for you below. Note that in our notebooks (and sometimes elsewhere), wherever you see a <FILL_IN_THE_BLANK> like in the chunk below, that is meant for you to replace with the correct phrase before you run the chunk (otherwise you will get an error).

# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
# Remember to use quotes to make it a character string
file.exists(<PUT_FILE_NAME_HERE>)

Now that we’ve determined that gene_results_GSE44971.tsv exists, we are ready to read it into our R environment.

Read a TSV file

In R, we will use file.path() for this (more on this later in this notebook).
Although base R has read-in functions, the readr functions are faster and more straightforward to use so we are going to use those here. Because the file we are reading in is a TSV (tab separated values) file we will be using the read_tsv function. But note that there are analogous functions for CSV (comma separated values) files (read_csv) and other files types.

Read in the differential expression analysis results file

Here we are using a tidyverse function read_tsv from a library of functions called readr.

stats_df <- readr::read_tsv(file.path(data_dir,
                                      "gene_results_GSE44971.tsv"))
Parsed with column specification:
cols(
  ensembl_id = col_character(),
  gene_symbol = col_character(),
  contrast = col_character(),
  log_fold_change = col_double(),
  avg_expression = col_double(),
  t_statistic = col_double(),
  p_value = col_double(),
  adj_p_value = col_double()
)
# The file we are reading in is our argument for read_tsv
gene_df <- read_tsv(file.path(data_dir, "GSE44971.tsv"))
Parsed with column specification:
cols(
  .default = col_double(),
  Gene = col_character()
)
See spec(...) for full column specifications.

Use this chunk to explore what gene_df looks like.

# Explore `gene_df`

What information is contained in gene_df?

dplyr pipes

A nifty thing with the tidyverse is the pipes: %>% This handy thing allows you to funnel the result of one expression to the next making your code a little more streamlined.

For example, the output from this:

filter(stats_df, contrast == "male_female")

…is the same as the output from this:

stats_df %>% filter(contrast == "male_female")

This makes it so your code is cleaner and easier to read. Let’s look at an example with our stats of of how the same functions look with or without pipes:

Example 1: without pipes:

stats_nopipe <- stats_df
stats_nopipe <- arrange(stats_nopipe, t_statistic)
stats_nopipe <- filter(stats_nopipe, avg_expression > 5)
stats_nopipe <- select(stats_nopipe, contrast, log_fold_change, p_value)

UGH, we have to repeat and assign stats_nopipe so many times here! It’s annoying and makes it harder for people to read.

Example 2: Same result as 1 but with pipes!

# Example of the same modifications as above but with pipes!
stats_pipe  <- stats_df %>%
               arrange(t_statistic) %>%
               filter(avg_expression > 5) %>%
               select(contrast, log_fold_change, p_value)

Let’s double check that these are the same by using the function, all.equal:

all.equal(stats_nopipe, stats_pipe)
[1] TRUE

all.equal is letting us know that these two objects are the same. (You can use all.equal in other instances and adjust the parameters depending on how exact of a match you are looking for. )

Now that hopefully you are convinced that tidyverse helps your code to be neater and easier to use and read, let’s go through some of the popular tidyverse functions and so we can create pipelines like this.

Common tidyverse functions

Let’s say we wanted to filter this gene expression dataset to particular sample groups. In order to do this, we would use the function filter as well as a logic statement.

# Here let's filter the data to a particular gene
stats_df %>% filter(gene_symbol == "SNCA")

We can use filter similarly for numeric statements.

# Here let's filter the data to have average expression values above 5
stats_df %>% filter(avg_expression > 5)

We can apply multiple filters at once:

stats_df %>% filter(contrast == "male_female", 
                    avg_expression > 5)

When we are filtering, the %in% operator can come in handy if we have multiple items we would like to match. Let’s take a look at what using %in% does.

stats_df$gene_symbol %in% c("SNCA", "CDKN1A")
   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [14] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [27] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [40] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [53] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [66] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [79] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [92] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [105] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [118] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [131] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [170] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [183] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [196] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [209] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [222] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [235] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [248] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [261] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [274] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [287] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [300] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [326] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [339] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [352] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [365] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [378] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [391] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [404] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [417] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [430] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [443] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [456] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [469] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [482] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [495] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [508] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [521] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [534] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [547] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [560] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [573] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [586] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [599] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [612] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [625] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [638] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [651] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [664] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [677] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [690] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [703] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [716] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [729] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [742] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [755] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [768] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [781] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [794] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [807] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [820] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [833] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [846] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [859] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [872] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [885] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [898] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [911] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [924] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [937] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [950] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [963] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [976] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [989] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [ reached getOption("max.print") -- omitted 1512 entries ]

%in% returns a logical vector that now we can use in dplyr::filter.

stats_df %>% filter(gene_symbol %in% c("SNCA", "CDKN1A"))

Let’s return to our first filter and build on to it. This time, let’s keep only some of these variables with the select function. Let’s also save this as a new data.frame called stats_filtered_df.

stats_filtered_df <- stats_df %>% 
  filter(contrast == "male_female", 
         avg_expression > 5) %>%
  select(log_fold_change, t_statistic)

Let’s say we wanted to arrange this dataset so that the patients were arranged youngest to oldest. In order to do this, we would use the function arrange as well as the variable we would like to sort by (in this case age_mths).

stats_df %>% arrange(p_value) 

What if we want the the oldest patients at the top of this dataset instead? We can use the same function, but instead use the desc function too.

stats_df %>% arrange(desc(p_value))

What if we would like to create new variables? For that we use mutate function.

stats_df %>% 
  mutate(log10 = -log10(p_value))

What if we want to obtain summary statistics for a variable? The summarize function allows us to calculate summary statistics for a column. Here we will use summarize to obtain an average expression overall all genes.

stats_df %>% summarize(mean(log_fold_change))

What if we’d like to obtain a summary statistics but have them for various groups? Conveniently named, there’s a function called group_by that seamlessly allows us to do this. Also note that group_by allows us to group by multiple variables at a time if you want to.

stats_summary_df <- stats_df %>%
      group_by(contrast) %>% 
      summarize(mean(log_fold_change))

Let’s look at a preview of what we made:

stats_summary_df

Chunking out your code: a technique for figuring out what’s happening

We did quite a few steps to get the object stats_summary_df. If we needed to troubleshoot this step, or if we got confused on what’s happening, it could be difficult to figure this out by looking at the code chunk all at once.

For these situations, we are going to introduce you to a technique called “chunking”.

Printing out smaller units of code at a time can help you learn what is happening at each step. This will help fast-track your troubleshooting attempts.

Use this empty code chunk below to try chunking out this code chunk by following the figure above.

# Try to chunk out lines of code we used to get `stats_summary_df`

For more troubleshooting techniques or a breakdown of the most common R errors, check out the debugging resources guide in this repository.

A brief intro to the apply family of functions

In base R there the apply family of functions can be an alternative methods for performing transformations across a data.frame, matrix or for other object structures.

The second argument in apply specifies whether we are applying something across rows or across columns (1 for rows, 2 for columns).

# Calculate row means
gene_means <- apply(gene_df[, -1], 1, mean) # Notice we are using 1 here

# How long will `gene_means` be? 
length(gene_means)
[1] 20056

Why do we have a [, -1] after gene_df in the above chunk? Now let’s investigate the same set up, but use 2 to apply over the columns of our matrix.

# Calculate sample means
sample_means <- apply(gene_df[, -1], 2, mean) # Notice we use 2 here

# How long will `sample_means` be? 
length(sample_means)
[1] 58

Although the apply functions may not be as easy to use as the tidyverse functions, for some applications, apply methods can be better suited. In this workshop, we will not delve too deeply into the various other apply functions (tapply, lapply, etc.) but you can read more examples about them here. ## The dplyr::join functions

Let’s say we have a scenario where we have two data.frames that we would like to combine. Recall that stats_pipe and gene_df are data.frames that contain information about some of the same genes. The dplyr::join family of functions are useful for various scenarios of combining data.frames.

For now, we will focus on inner_join, which will combine data.frames by only keeping information about the samples that are in both data.frames. We need to use the by argument to designate what column(s) should be used as a key to match the data.frames (in this case ensembl_ids).

stats_df %>% inner_join(gene_df, 
                        by = c('ensembl_id' = 'Gene')) 

Note that if there are columns of the same name in both data.frames, the default is to save both columns but a x or y is added to the name to note which data.frame it originated from.

Save data to files

Save to TSV files

Let’s write this data we edited to a file. To do this, we can use the readr library of write functions. Note that the second argument of write_tsv needs to be a character string that contains our file.path to the new file we would like to create. Remember that we created a results directory to put our output in. But if we want to save our data to a directory other than our working directory, we need to specify this. This is what we will use the file.path function for. Let’s look at what file.path does, by taking a look at the print out of it from examples below.

# Which of these file paths is what we want to use to save our data to the
# results directory we created at the beginning of this notebook?
file.path("docker-install", "stats_summary.tsv")
[1] "docker-install/stats_summary.tsv"
file.path("results", "stats_summary.tsv")
[1] "results/stats_summary.tsv"
file.path("stats_summary.tsv", "results")
[1] "stats_summary.tsv/results"

Replace <NEW_FILE_PATH> below with the file.path from above that will successfully save our file to the results folder

# Write our data.frame to a TSV file
write_tsv(stats_summary_df, <NEW_FILE_PATH>)

Check in your results directory to see if your new file has successfully saved.

Save to RDS files

For this example we have been working with data.frames, however, in other situations we may want to save more complicated structure objects or very large objects. RDS (R Data Serialized/Single) files may be a better option in these instances. RDS is R’s special file format for holding data exactly as you have it in your R environment. RDS files can also be compressed, meaning they will take up less space on your computer. Let’s save our data to an RDS file in our results folder. You will need to replace the .tsv with .RDS, but you can use what we determined as our file.path for the last chunk as your template.

# Write your object to an RDS file
write_rds(stats_summary, <PUT_CORRECT_FILE_PATH_HERE>)

Read an RDS file

Since now you have learned the readr functions: read_tsv, write_tsv, and now, write_rds, what do you suppose the function you will need to read your RDS file is called? Use that function here to re-import your data in the chunk we set up for you below.

# Read in your RDS file
reimport_df <- <PUT_FUNCTION_NAME>(file.path("results", "stats_clean.RDS"))

As is good practice, we will end this session by printing out our session info.

Session Info

# Print out the versions and packages we are using in this session
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2     readr_1.3.1    
[6] tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1           lubridate_1.7.4      lattice_0.20-38     
 [4] assertthat_0.2.1     zeallot_0.1.0        rprojroot_1.3-2     
 [7] digest_0.6.20        utf8_1.1.4           R6_2.4.0            
[10] cellranger_1.1.0     backports_1.1.4      stats4_3.6.0        
[13] RSQLite_2.1.1        evaluate_0.14        httr_1.4.0          
[16] pillar_1.4.2         rlang_0.4.0          lazyeval_0.2.2      
[19] readxl_1.3.1         rstudioapi_0.10      blob_1.1.1          
[22] S4Vectors_0.24.1     rmarkdown_1.13       labeling_0.3        
[25] bit_1.1-14           munsell_0.5.0        broom_0.5.2         
[28] compiler_3.6.0       modelr_0.1.4         xfun_0.8            
[31] pkgconfig_2.0.2      BiocGenerics_0.32.0  base64enc_0.1-3     
[34] htmltools_0.3.6      tidyselect_0.2.5     IRanges_2.20.1      
[37] fansi_0.4.0          crayon_1.3.4         withr_2.1.2         
[40] grid_3.6.0           nlme_3.1-140         jsonlite_1.6        
[43] gtable_0.3.0         DBI_1.0.0            magrittr_1.5        
[46] scales_1.0.0         cli_1.1.0            stringi_1.4.3       
[49] limma_3.42.0         xml2_1.2.0           vctrs_0.1.0         
[52] generics_0.0.2       org.Hs.eg.db_3.10.0  tools_3.6.0         
[55] bit64_0.9-7          Biobase_2.46.0       glue_1.3.1          
[58] hms_0.4.2            rsconnect_0.8.13     parallel_3.6.0      
[61] yaml_2.2.0           AnnotationDbi_1.48.0 colorspace_1.4-1    
[64] BiocManager_1.30.4   rvest_0.3.4          memoise_1.1.0       
[67] knitr_1.23           haven_2.1.1         
---
title: "Intro to ggplot2"
output:   
  html_notebook: 
    toc: true
    toc_float: true
---

**CCDL 2020**

## Objective for this notebook analysis: 

We'll use the same gene expression dataset we used in the previous notebook. 
It a pre-processed [astrocytoma microarray dataset](https://www.refine.bio/experiments/GSE44971/gene-expression-data-from-pilocytic-astrocytoma-tumour-samples-and-normal-cerebellum-controls) that we performed [differential expression analyses](./scripts/00-setup-intro-to-R.R).

**More tidyverse resources:**  
- [R for Data Science](https://r4ds.had.co.nz/)
- [Tidyverse documentation](https://dplyr.tidyverse.org/)
- [Cheatsheet of Tidyverse data transformation](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)
- [Online Tidyverse book chapter](https://privefl.github.io/advr38book/tidyverse.html)

**Concepts to cover (in order)** 
- Referencing a library's function with `::`
- dplyr::select
- dplyr::filter
- dplyr::mutate
- apply functions?
- Writing to RDS/TSV 

## Set Up

Declare the name of the directory where we will read in the data. 

```{r}
data_dir <- "data"
```

The `tidyverse` is a collection of packages that are handy for general data 
wrangling, analysis, and visualization. 
Other packages that are specifically handy for different biological analyses are 
found on [Bioconductor](https://www.bioconductor.org/).
If we want to use a package's functions we first need to install them.
In our Docker container, we already have `tidyverse` and other packages we will
use in this workshop installed for you. 
But if you needed to install it or other packages available on CRAN, you 
do it using the `install.packages` function like this: `install.packages("tidyverse")`

```{r Load tidyverse}
library(tidyverse)
```

### Referencing a library's function with `::`

Note that if we had not imported the tidyverse set of packages using `library` 
like above, and we were using a tidyverse function like `read_tsv`, we would need to 
tell R what package to find this function in.
To do this, we would use `::` to tell R to load in this function from the 
`readr` package by using `readr::read_tsv`.
You will see this `::` method of referencing libraries throughout the course.  

## Managing directories

Before we can import the data we need, we should double check where R is 
looking for files aka our what our *working* directory currently is. 
We can do this by using the `getwd()` function, which will tell us what folder
we are in. 

```{r}
# Let's check what directory we are in:
getwd()
```

Note that for Rmd files, the working directory is always wherever the file is 
located. 

We will want to make a directory for our output and we will call this directory: 
`results`. 
But before we create the directory, we should check if it already exists. 
We will show two ways that we can do this. 

First, we can use the `dir()` function to have R list the files in our working 
directory. 

```{r}
# Let's check what files are here
dir()
```

This shows us there is no folder called "results" yet. 

If we want to more pointedly look for "results" in our working directory we can 
use the `dir.exists` function.

```{r}
# Check if the results directory exists
dir.exists("results")
```

If the above says `FALSE` that means we will need to create a `results` directory
using the function `dir.create`.

```{r}
# Make a directory within the working directory called 'results'
dir.create("results")
```

After creating the results directory above, let's re-run `dir.exists` to see if 
now it exists.

```{r}
# Re-check if the results directory exists
dir.exists("results")
```

We can use the output of `dir.exists` to automatically create or hold off on 
creating a directory by putting this together in an `if` statement like below. 
Note that we put an exclamation point to signify that we want a directory to be
created only *if* `dir.exists(results)` is NOT equal to `TRUE`.

```{r}
# If 'results' directory doesn't exist...
if (!dir.exists("results")) {
  # ... create a 'results' directory
  dir.create("results")
}
```

Note that the `dir.exists` function will not work on files themselves.
In this case, there is an analogous function called `file.exists`.

Try using the `file.exists` function to see if the file `gene_results_GSE44971.tsv` file 
exists in the current directory.
Use the code chunk we set up for you below. 
Note that in our notebooks (and sometimes elsewhere), wherever you see a 
`<FILL_IN_THE_BLANK>` like in the chunk below, that is meant for you to replace 
with the correct phrase before you run the chunk (otherwise you will get an error).

```{r eval=FALSE}
# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
# Remember to use quotes to make it a character string
file.exists(<PUT_FILE_NAME_HERE>)
```

Now that we've determined that `gene_results_GSE44971.tsv` exists, we are ready to read it 
into our R environment.

#### Read a TSV file

In R, we will use `file.path()` for this (more on this later in this notebook).  
Although base R has read-in functions, the `readr` functions are faster and more
straightforward to use so we are going to use those here. 
Because the file we are reading in is a TSV (tab separated values) file we will 
be using the `read_tsv` function. 
But note that there are analogous functions for CSV (comma separated values) 
files (`read_csv`) and other files types.

## Read in the differential expression analysis results file

Here we are using a `tidyverse` function `read_tsv` from a library of functions called `readr`. 

```{r}
stats_df <- readr::read_tsv(file.path(data_dir,
                                      "gene_results_GSE44971.tsv"))
```
```{r}
# The file we are reading in is our argument for read_tsv
gene_df <- read_tsv(file.path(data_dir, "GSE44971.tsv"))
```

Use this chunk to explore what `gene_df` looks like. 

```{r}
# Explore `gene_df`
```

What information is contained in `gene_df`?

## dplyr pipes

A nifty thing with the tidyverse is the pipes: `%>%`
This handy thing allows you to funnel the result of one expression to the next
making your code a little more streamlined.

For example, the output from this:  

```{r}
filter(stats_df, contrast == "male_female")
```  

...is the same as the output from this:  

```{r}
stats_df %>% filter(contrast == "male_female")
```  
  
This makes it so your code is cleaner and easier to read. 
Let's look at an example with our stats of of how the same 
functions look with or without pipes:

*Example 1:* without pipes: 

```{r}
stats_nopipe <- stats_df
stats_nopipe <- arrange(stats_nopipe, t_statistic)
stats_nopipe <- filter(stats_nopipe, avg_expression > 5)
stats_nopipe <- select(stats_nopipe, contrast, log_fold_change, p_value)
```
  
UGH, we have to repeat and assign `stats_nopipe` so many times here! 
It's annoying and makes it harder for people to read. 
  
*Example 2:* Same result as 1 but with pipes!

```{r}
# Example of the same modifications as above but with pipes!
stats_pipe  <- stats_df %>%
               arrange(t_statistic) %>%
               filter(avg_expression > 5) %>%
               select(contrast, log_fold_change, p_value)
```

Let's double check that these are the same by using the function, all.equal: 

```{r}
all.equal(stats_nopipe, stats_pipe)
```

`all.equal` is letting us know that these two objects are the same. 
(You can use all.equal in other instances and adjust the parameters depending on
how exact of a match you are looking for. )

Now that hopefully you are convinced that tidyverse helps your code to be neater 
and easier to use and read, let's go through some of the popular tidyverse 
functions and so we can create pipelines like this. 

## Common tidyverse functions

Let's say we wanted to filter this gene expression dataset to particular sample
groups.
In order to do this, we would use the function `filter` as well as a logic statement.

```{r}
# Here let's filter the data to a particular gene
stats_df %>% filter(gene_symbol == "SNCA")
```

We can use filter similarly for numeric statements.  

```{r}
# Here let's filter the data to have average expression values above 5
stats_df %>% filter(avg_expression > 5)
```

We can apply multiple filters at once:

```{r }
stats_df %>% filter(contrast == "male_female", 
                    avg_expression > 5)
```

When we are filtering, the `%in%` operator can come in handy if we have multiple
items we would like to match.
Let's take a look at what using `%in%` does.

```{r}
stats_df$gene_symbol %in% c("SNCA", "CDKN1A")
```

`%in%` returns a logical vector that now we can use in `dplyr::filter`.

```{r }
stats_df %>% filter(gene_symbol %in% c("SNCA", "CDKN1A"))
```

Let's return to our first `filter` and build on to it. 
This time, let's keep only some of these variables with the `select` function. 
Let's also save this as a new data.frame called `stats_filtered_df`.

```{r}
stats_filtered_df <- stats_df %>% 
  filter(contrast == "male_female", 
         avg_expression > 5) %>%
  select(log_fold_change, t_statistic)
```

Let's say we wanted to arrange this dataset so that the patients were arranged
youngest to oldest.
In order to do this, we would use the function `arrange` as well as the variable
we would like to sort by (in this case `age_mths`).

```{r}
stats_df %>% arrange(p_value) 
```

What if we want the the oldest patients at the top of this dataset instead? 
We can use the same function, but instead use the `desc` function too. 

```{r}
stats_df %>% arrange(desc(p_value))
``` 

What if we would like to create new variables? 
For that we use `mutate` function. 

```{r}
stats_df %>% 
  mutate(log10 = -log10(p_value))
```

What if we want to obtain summary statistics for a variable? 
The `summarize` function allows us to calculate summary statistics for a column. 
Here we will use summarize to obtain an average expression overall all genes. 

```{r}
stats_df %>% summarize(mean(log_fold_change))
```

What if we'd like to obtain a summary statistics but have them for various groups?
Conveniently named, there's a function called `group_by` that seamlessly 
allows us to do this. 
Also note that `group_by` allows us to group by multiple variables at a time if 
you want to.

```{r}
stats_summary_df <- stats_df %>%
      group_by(contrast) %>% 
      summarize(mean(log_fold_change))
```

Let's look at a preview of what we made:

```{r}
stats_summary_df
```

### Chunking out your code: a technique for figuring out what's happening

We did quite a few steps to get the object `stats_summary_df`. 
If we needed to troubleshoot this step, or if we got confused on what's 
happening, it could be difficult to figure this out by looking at the code chunk 
all at once. 

For these situations, we are going to introduce you to a technique called 
"chunking".

Printing out smaller units of code at a time can help you learn what is happening
at each step. 
This will help fast-track your troubleshooting attempts. 

Use this empty code chunk below to try chunking out this code chunk by following
the figure above. 

```{r}
# Try to chunk out lines of code we used to get `stats_summary_df`
```

For more troubleshooting techniques or a breakdown of the most common R errors, 
check out the [debugging resources guide](https://github.com/AlexsLemonade/training-modules/blob/master/intro-to-R-tidyverse/00b-debugging_resources.md)
in this repository.

## A brief intro to the `apply` family of functions

In base R there the `apply` family of functions can be an alternative 
methods for performing transformations across a data.frame, matrix or for other 
object structures. 

The second argument in `apply` specifies whether we are applying something 
across rows or across columns (1 for rows, 2 for columns).

```{r}
# Calculate row means
gene_means <- apply(gene_df[, -1], 1, mean) # Notice we are using 1 here

# How long will `gene_means` be? 
length(gene_means)
```

Why do we have a `[, -1]` after `gene_df` in the above chunk?
Now let's investigate the same set up, but use 2 to `apply` over the columns of
our matrix.

```{r}
# Calculate sample means
sample_means <- apply(gene_df[, -1], 2, mean) # Notice we use 2 here

# How long will `sample_means` be? 
length(sample_means)
```

Although the `apply` functions may not be as easy to use as the `tidyverse` 
functions, for some applications, `apply` methods can be better suited.
In this workshop, we will not delve too deeply into the various other apply 
functions (`tapply`, `lapply`, etc.) but you can read more examples 
about them [here](https://www.guru99.com/r-apply-sapply-tapply.html).
## The dplyr::join functions

Let's say we have a scenario where we have two data.frames that we would like to 
combine. 
Recall that `stats_pipe` and `gene_df` are data.frames that contain information about some of the same genes.
The [`dplyr::join` family of functions](https://dplyr.tidyverse.org/reference/join.html) 
are useful for various scenarios of combining data.frames. 

For now, we will focus on `inner_join`, which will combine data.frames by only
keeping information about the samples that are in both data.frames. 
We need to use the `by` argument to designate what column(s) 
should be used as a key to match the data.frames (in this case `ensembl_id`s). 

```{r}
stats_df %>% inner_join(gene_df, 
                        by = c('ensembl_id' = 'Gene')) 
```

Note that if there are columns of the same name in both data.frames, the default 
is to save both columns but a `x` or `y` is added to the name to note which 
data.frame it originated from.

## Save data to files

#### Save to TSV files

Let's write this data we edited to a file.
To do this, we can use the `readr` library of `write` functions. 
Note that the second argument of `write_tsv` needs to be a character string that 
contains our `file.path` to the new file we would like to create.
Remember that we created a `results` directory to put our output in. 
But if we want to save our data to a directory other than our working directory, 
we need to specify this. 
This is what we will use the `file.path` function for. 
Let's look at what `file.path` does, by taking a look at the print out of it 
from examples below.

```{r}
# Which of these file paths is what we want to use to save our data to the
# results directory we created at the beginning of this notebook?
file.path("docker-install", "stats_summary.tsv")
file.path("results", "stats_summary.tsv")
file.path("stats_summary.tsv", "results")
```

Replace `<NEW_FILE_PATH>` below with the `file.path` from above that will 
successfully save our file to the `results` folder 

```{r eval=FALSE}
# Write our data.frame to a TSV file
write_tsv(stats_summary_df, <NEW_FILE_PATH>)
```

Check in your `results` directory to see if your new file has successfully saved.

#### Save to RDS files

For this example we have been working with `data.frame`s, however, in other 
situations we may want to save more complicated structure objects or very large 
objects. 
RDS (R Data Serialized/Single) files may be a better option in these instances.
RDS is R's special file format for holding data exactly as you have it in your 
R environment. 
RDS files can also be compressed, meaning they will take up less space on your 
computer. 
Let's save our data to an RDS file in our `results` folder.
You will need to replace the `.tsv` with `.RDS`, but you can use what we 
determined as our `file.path` for the last chunk as your template. 

```{r eval=FALSE}
# Write your object to an RDS file
write_rds(stats_summary, <PUT_CORRECT_FILE_PATH_HERE>)
```

#### Read an RDS file

Since now you have learned the `readr` functions: `read_tsv`, `write_tsv`, and 
now, `write_rds`, what do you suppose the function you will need to read your 
RDS file is called? 
Use that function here to re-import your data in the chunk we set up for you
below.

```{r eval=FALSE}
# Read in your RDS file
reimport_df <- <PUT_FUNCTION_NAME>(file.path("results", "stats_clean.RDS"))
```

As is good practice, we will end this session by printing out our session info. 

### Session Info

```{r}
# Print out the versions and packages we are using in this session
sessionInfo()
```
